feat: porting ngsderive #35

a-frantz · 2023-12-04T16:42:07Z

Porting https://github.com/stjudecloud/ngsderive to this repo.

An attempt was made to keep the algorithms in this port and the original repo as similar/consistent as possible. However some tweaks and improvements were made, which will be detailed below. The biggest difference between the Python code and the Rust code is the output results. ngsderive reported TSVs with pretty limited information. This code reports results as JSON and outputs comprehensive metrics related to processing, to make debugging/assessment easier. I'm not going to detail all the new metrics reported here, check out the code for that.

encoding
- no significant changes
endedness
- Started checking the is_segmented bitwise flag. This led to a rework of Single-End classification.
  - if is_segmented == false we DO NOT check is_first or is_last
  - if all reads have is_segmented == false then we call it Single-End
    - (still have the optional condition of RPT==1.0 to classify as Single-End)
  - No longer do we expect SE data to have BOTH first and last flags set.
  - Paired-End logic is essentially the same, but now we require that every read is_segmented
- In Rust, we aren't using any fancy data structure for QNAMEs (in Python we use a prefix tree). I tried a prefix tree in Rust and it horribly tanked performance. IDK why, it should improve things. Gave up on optimizing this command, we can always come back to it.
instrument:
- Was already implemented. Not significantly changed by this PR, but there were some tweaks so it behaved similarly to the other (new) derive commands.
- reports per-RG results now
- on paper this was a multi-threaded implementation. That really confused me because it's not actually multi-threaded. So I stripped away that async wrapping and made it clear that it was single-threaded.
- I added some additional metric reporting
- I added a record counter that logs every 1million reads
- No other changes of import
junction-annotation:
- biggest change also impacted strandedness: MAPQ filtering
  - This Simon Andrews article is a good rundown of the various ways MAPQs are (erroneously) used in bioinformatics: https://sequencing.qcfail.com/articles/mapq-values-are-really-useful-but-their-implementation-is-a-mess/
  - Notably, STAR sets a MAPQ of 255 for unique alignments and the spec (and therefore noodles) interprets that 255 as the special value meaning "MAPQ is missing".
  - noodles never returns a MAPQ of 255 because of this. PYSAM (in Python) does return values of 255.
  - In ngsderive, we have a MAPQ filter of 30 as the default. On our STAR data, this meant "only look at unique mappers. (Discard multi-mappers.)". Doesn't work in Rust when we're using noodles.
  - My resolution to this was to default to "no MAPQ filter" which in practice on STAR data means look at both unique mappers and multi-mappers.
  - If a user wants to enable a MAPQ filter they can, it will just block any 255 values.
- This implementation readily lends itself to being adopted by the junction-saturation algorithm (unlike the Python code, which was complicated by a reliance on PYSAM). So we can implement that pretty easily in the future.
readlen:
- nothing significant
strandedness:
- Same MAPQ filter problem+solution as junction-annotation
- So as I wrote this, I discovered more and more problems with the Python code. Some big some small problems. But hopefully all "fixed" here.
- I think this had the largest impact on results: there was no (working) filter for genes with mixed-strand exons. There was code for disqualifying these cases, it just didn't work in Python.
- Now, all genes are ensured to be either on the Forward or Reverse Strand. Then it's ensured all overlapping exons are on that same strand. This check disqualifies roughly half of the protein coding genes in Gencode release v32 (what I tested with). That might sound like a lot of genes to disqualify, but I think it checks out for two reasons.
  1. results are significantly better with this filter than without. I wasn't rigorous and didn't do any stats, but by-my-eyeball, with the strand filter I was getting results in the >99% reverse evidence area, and without the strand filter the same samples were returning reverse evidence in the ~90% area.
  2. a nature article that get similar-ish numbers as me with a similar-ish analysis: https://www.nature.com/articles/s41598-019-49802-w#:~:text=Accordingly%2C%20we%20used%20the%20current,overlapped%20with%20their%20adjacent%20genes (discovered courtesy of Andrew T.)
- another filter that didn't work in Python: --min-reads-per-gene. Everything was being counted as evidence. This filter didn't do anything. It now works. I didn't notice any swing in the results when I played with disabling this. IDK if it really impacts results much at all. At least when it's at a value of 10 VS 0. Maybe jacking it up would change things? IDK, haven't investigated.

Before submitting this PR, please make sure:

You have added a few sentences describing the PR here.
You have added yourself or the appropriate individual as the assignee.
You have added at least one relevant code reviewer to the PR.
You have added any relevant tags to the pull request.
Your code builds clean without any errors or warnings (use cargo test and cargo clippy).
You have added tests (when appropriate).
You have updated the wiki (when appropriate).
You have updated the README or other documentation to account for these changes (when appropriate).

This reverts commit 553fd68.

claymcleod

Some intermediate comments (some are old) from my last review. I want to get these in front of you so you can address them before squashing or whatever you plan to do with the commits here.

src/derive/command/encoding.rs

src/derive/command/endedness.rs

claymcleod · 2024-03-21T21:19:36Z

src/derive/command.rs

@@ -20,6 +25,25 @@ pub struct DeriveArgs {
 /// All possible subcommands for `ngs derive`.
 #[derive(Subcommand)]
 pub enum DeriveSubcommand {
+    /// Derives the quality score encoding used to produce the file.
+    Encoding(self::encoding::DeriveEncodingArgs),


FYI, we're going to change how this is named slightly in the future.

src/derive/command.rs

src/derive/command/encoding.rs

Co-authored-by: Clay McLeod <[email protected]>

a-frantz added 2 commits December 3, 2023 13:32

feat(derive): new readlen subcommand

0c01978

fix(derive/command/readlen): proper error message

6bb2c2a

a-frantz requested a review from claymcleod December 4, 2023 16:42

a-frantz self-assigned this Dec 4, 2023

a-frantz added the feature Introduces a new feature to the codebase label Dec 4, 2023

a-frantz and others added 25 commits December 4, 2023 11:46

[WIP] protoype for endedness and skeleton of encoding

23d8a6c

[WIP]: doesn't compile. Begin implementing RPT calculations

5ed13e6

revise: applies Clay's edits

c2fe1d8

[WIP]: Broken, not compiling. starting to calc RPT.

b4ce76b

[WIP]

9aeebf4

feat: working derive endedness subcommand

137beea

chore: better comments/test name/log message

d1cac4d

fix: apply some of Clay's performance suggestions

d92368f

fix(derive/endedness/compute.rs): test updates

7e6ad22

chore(derive/readlen/compute): cleanup

34bf08d

fix(derive/command/endedness): try using HashMap instead of Trie

43aaec3

feat(derive/command/endedness): lazy record reading

553fd68

Revert "feat(derive/command/endedness): lazy record reading"

1b1117c

This reverts commit 553fd68.

tests(derive/endedness/comput): reimplement tests

b1e9e86

[WIP] suggestions from @zaeleus. Compiler error

ae133d6

tests(derive/command/endedness): rewrite tests with latest changes

e3b8e85

chore(endedness): handle error where RG tag can't be parsed as_str()

8b23de0

fix(endedness): remove all Arcs and lazy_statics

65ea501

refactor(derive/readlen): move num_samples counting to outer func

76305f2

perf(derive/readlen): don't iterate through all read_lengths

d63eb2a

feat(derive/endedness): add validate_read_group_info() call

cb1d2f5

fix: typos

9b45316

revert

520ae46

revert

f280c0f

fix: corrections made after previous reverts

12f3f66

a-frantz added 20 commits February 10, 2024 19:19

feat(derive/instrument): behave more like other derive commands

9668a6d

fix(derive): print!(output) -> println!(output)

f569b38

fix(derive/strandedness): move RG validation out of compute

6a23773

tests(derive/endedness): fix the broken tests

9be3850

style(derive/junction-annotation): group reported junctions by contig

2eac8d7

fix: lots of code cleanup

b67bac8

tests(derive): all derive commands have tests

e1473a6

fix: consistently return Options for String results

42d09dc

style: prettify imports

55b3bc7

style: wrap optional variable in Option

ded9a89

docs: fix intra links

11dc18d

tests: fix broken gene test

4225ec0

feat(derive/instrument): more debug statements

780d3bc

feat(derive/instrument): in output, report found unique names

09a2be2

feat: more info in JSON report. Not complete yet. Open TODOs

05a53fc

style(derive/instrument): a bit of code clean up

eae95c7

docs: filling in TODOs

00999c1

chore: delete dead code

2661300

tests(derive/junction-annotation): rewrite tests more modular

bfac8a1

stlye: Michael M. feedback

7ab84f2

a-frantz marked this pull request as ready for review February 22, 2024 17:10

claymcleod reviewed Mar 21, 2024

View reviewed changes

a-frantz and others added 8 commits March 22, 2024 09:25

Apply suggestions from code review

68f999c

Co-authored-by: Clay McLeod <[email protected]>

fix(derive/endedness): complete rename from last commit

451a2a4

feat: use NonZeroUsize for Number of Records

aa0880d

feat(utils/args): improved behavior for NumberOfRecords CL utility

406a7b8

feat(derive): report by read group where appropriate and feasible

f83015f

chore: removing dead code

cf3a869

tests(derive/instrument): assert that read groups succeed

7bbb7e5

fix(instrument): properly init flowcell entries

6091af4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: porting ngsderive #35

feat: porting ngsderive #35

a-frantz commented Dec 4, 2023 •

edited

Loading

claymcleod left a comment

claymcleod Mar 21, 2024

feat: porting ngsderive #35

Are you sure you want to change the base?

feat: porting ngsderive #35

Conversation

a-frantz commented Dec 4, 2023 • edited Loading

claymcleod left a comment

Choose a reason for hiding this comment

claymcleod Mar 21, 2024

Choose a reason for hiding this comment

a-frantz commented Dec 4, 2023 •

edited

Loading